Issues in Corpus Creation and Distribution: The Evolution of the Linguistic Data Consortium
نویسندگان
چکیده
The Linguistic Data Consortium (LDC) is a non-profit consortium of universities, companies and government research laboratories that supports education, research and technology development in language related disciplines by collecting or creating, distributing and archiving language resources including data and accompanying tools, standards and formats. LDC was founded in 1992 with a grant from the Defense Advanced Research Projects Agency (DARPA) to the University of Pennsylvania as host organization. LDC publication and distribution activities self-support from membership fees and data sales while new data creation is supported primarily by grants from DARPA and the National Science Foundation. Recent developments in the creation and use of language resources demand new roles for international data centers. Since our report at the last Language Resource and Evaluation Conference in Granada in 1998, LDC has observed growth in the demand for language resources along multiple dimensions: larger corpora with more sophisticated annotation in a wider variety of languages are used in an increasing number of language related disciplines. There is also increased demand for reuse of existing corpora. Most significantly, small research groups are taking advantage of advances in microprocessor technology, data storage and internetworking to create their own corpora. This has lead to the birth of new annotation practices whose very variety creates barriers to data sharing. This paper will describe recent LDC efforts to address emerging issues in the creation and distribution of language resources. 1. The Value of Language Resources Developing realistic models of human language that support research and technology development in language related fields requires masses of linguistic data: preferably hundreds of hours of speech, tens of millions of words of text and lexicons of a hundred-thousand words or more. Although independent researchers and small research groups now have the desktop capacity to create smallto medium-scale corpora, the collection, annotation and distribution of resources on a larger scale presents not only computational difficulties but also legal and logistical difficulties to challenge most research organizations whether they be educational, commercial or governmental. While some large corporate research groups routinely engage in mediumto large-scale corpus creation, these groups typically lack the necessary distribution infrastructure; resources created at considerable cost in those environments are seldom shared outside the immediate group. Published language resources benefit a broad spectrum of researchers, technology developers and their customers. The presence of community standard resources reduces duplication of effort, distributes production costs and removes a barrier to entry. As research communities mature, published resources are corrected, improved and further annotated. They provide a stable reference point for the comparison of different analytic approaches. Over the past two decades, the situation for language engineers has evolved from one in which concerns over intellectual property, usage agreements, publication standards and replication costs prevented resource sharing to the current state in which the value of shared resources is widely recognized. Based on the success of the DARPA “common task” methodology and the popularity of early shared databases such as the Brown text corpus and Texas Instruments’ TI 46 and TI DIGITS corpora, the LDC was created to foster the development, distribution, archiving and maintenance of language resources in electronic form. 2. The Linguistic Data Consortium The Linguistic Data Consortium was founded in 1992 with an initial grant from the Defense Advance Research Projects Agency (DARPA) and continuing funding from DARPA and the National Science Foundation (NSF). The University of Pennsylvania serves as the LDC’s host institution, providing space, infrastructure and logistical support. LDC staffers are University employees, and Penn enters into all legal arrangements on behalf of the consortium members and the research community at large. From the beginning, LDC has build fruitful links with groups in Europe, Asia and other parts of the world. In Europe, the principal partner has been ELRA but hundreds of European organizations are also LDC members. More than one hundred Asian organizations use LDC data and LDC has maintained links with the emerging GSK group in Japan. Because progress in several areas of language technology depends upon the accessibility of the consortium’s products, LDC is open to researchers around the world. Before LDC was created, an external planning committee set the membership fees that have not changed in 8 years. The membership fee for a university is roughly the cost of a new PC or attendance at an international technical meeting. The membership fee for a commercial organization is roughly the cost of a high-end workstation, certainly less than the cost to create a single small-scale corpus and an order of magnitude less than the cost of an average medium-scale corpus. As a matter of policy, no bona fide researcher is prevented from having access to LDC data by genuine inability to pay. Organizations join LDC on a yearly basis and gain perpetual rights to all corpora produced in the years in which they join. Current LDC members also have network access to LDC Online, a service that facilitates browsing and searching of indexed text, speech and lexical corpora. Since 1992, nearly 1000 organizations worldwide have used LDC data; more than 300 companies, universities and government research laboratories have joined the consortium; almost 700 others have purchased one or more corpora. Of all the organizations that use LDC data, about half are American. Europeans comprise a third of the user base with the remaining groups hailing from Asia, the Middle East, Africa and Australia. As required by the terms of LDC’s founding grant, membership fees and data sales provide funding to support the consortium’s ongoing activities in the publication, documentation, maintenance and distribution of databases as well as the negotiation of necessary legal arrangements plus a small amount of new database creation. Often, databases created elsewhere must be extensively transformed to make them suitable for electronic publication; this work is also carried out at the LDC with internal financing. Responding to demands from its constituent research communities, LDC has expanded its role from that of a specialized data publisher to include data collection, corpus creation, and research on the use and structure of language resources. LDC staff has grown accordingly. Twenty regular staffers manage the research, technical, collection, annotation, publication and customer service functions of LDC’s Philadelphia office. LDC also maintains a part-time workforce that varies from 10 to 35 staffers depending upon project workload. 3. LDC Data Publication The primary function of the LDC is the publication and archiving of data resources. LDC publishes most corpora on digital media, currently CD-ROM. The sheer volume of some corpora begs for distribution on a denser medium. LDC expects to publish its data on DVD or some future innovation once its market penetration is sufficient to be cost effective. Every corpus, unless prevented by intellectual property agreements, becomes available for network access via LDC Online as well. 3.1. Distribution on Digital Media Each year LDC publishes between 15 and 25 corpora. About half come from outside organizations that have collected and annotated data on their own but asked LDC to assist with the final formatting, intellectual property arrangements and distribution. The other half are corpora created by or with the help of LDC. The latter typically support government-sponsored technology evaluation projects. About two-thirds of LDC’s publications are speech corpora, the remaining third are text corpora and lexicons. At the time of writing, LDC had published 164 corpora including 106 speech corpora, 48 text corpora and 10 lexicons. Some of the most recent include: Topic Detection and Tracking Corpus – richly annotated broadcast news and newswire in English and Mandarin described further below Treebank 3 the latest update of the landmark hand-parsed corpus of written and conversational English Corpus of Spoken American English collected by the University of California, Santa Barbara Center for the Study of Discourse (John W. Du Bois, Director) and representing the American Component of the International Corpus of English (Charles W. Meyer, Director). BLLIP 1987-89 WSJ Corpus – a Treebank-style parsing of the three-year, 30 million word, Wall Street Journal archive from the ACL/DCI corpus developed by Eugene Charniak and his group at Brown University Speech Under Simulated and Actual Stress (SUSAS) created by the Robust Speech Processing Laboratory at Duke University (Professor John H. L. Hansen, Director) Taiwanese Putonghua 40 transcribed monologues and dialogues in Taiwanese accented Putonghua gathered by San Duanmu at the University of Michigan American English Spoken Dictionary – containing recorded pronunciations of 50,000 of the most common English words JURIS the database of the Justice Department Retrieval and Inquiry System containing almost 700,000 legal documents from the 1700's through the early 1990's covering Administrative, Case, Statutory and Tax Law, plus Executive Orders, Regulations, and International Agreements, etc. Voicemail Corpus – over 1800 messages contributed by IBM volunteers and collected and Figure 1: Organizations that use LDC by country. Marks indicate country but not exact location. transcribed by M. Padmanabhan, G. Ramaswamy, B. Ramabhadran, P. S. Gopalakrishnan and C. Dunn at IBM. Broadcast News audio and transcripts in English, Mandarin and Spanish News text corpora in English, Mandarin, Japanese, Portuguese and Spanish Conversational audio and transcripts in Egyptian Colloquial Arabic, English, German, Mandarin and Spanish Pronouncing lexicons in Egyptian Colloquial Arabic, English and German This is just a sampling of LDC publications since the last LREC report. For a complete listing, readers are encouraged to visit the LDC catalog at: www.ldc.upenn.edu/Catalog 3.2. Network Access to LDC Data LDC’s most common mode of publication has been to organize data on one or more volumes of computerreadable media in standard or de facto standard formats and distribute these upon request. This mode works best for research groups that already know which data sets they require and have the local infrastructure to handle the media and formats and to process the data in large quantities. Other research communities, however, take an exploratory approach, testing hypotheses on small batches of carefully selected data. For these groups and communities, LDC Online provides more useful access. LDC Online provides network access to LDC’s text, audio and lexical resources that are not otherwise restricted by intellectual property arrangements. With LDC Online, users may browse resources linearly, or search text resources by word, lemma, part of speech or any combination of these elements. Statistics such as word frequency are also available. For corpora containing audio data and transcripts, a search against the transcripts also returns a link to the audio. To facilitate network access, LDC transcripts are typically aligned to the audio in small segments (8-10 seconds) and are available in any of the currently popular audio formats. The American English Spoken Lexicon (AESL) provides an example of audio and lexical data combining with fine-grained indexing to create an Internet resource for linguists, language teachers and others with similar needs. AESL audio files contain pronunciations of each of 50,000 of the commonest English words as counted in LDC’s English corpora. AESL is freely available (http://www.ldc.upenn.edu/cgi-bin/aesl/aesl) from LDC’s web site where users can either browse the lexicon alphabetically or search for words by spelling or pronunciation. LDC encourages use of the resources in LDC-Online for research and education. Users or potential users with questions should contact [email protected]. Visitors to our web site who are not LDC members may acquire guest accounts that permit the same kinds of electronic access to a large sample of our data including the Brown text corpus and the TIMIT speech corpus. 4. LDC Data Creation Over the past several years, LDC has become increasingly involved in the collection and annotation of language resources. Although this was not one of the functions originally envisioned for the consortium, the needs of several research communities for large-scale corpus creation and LDC’s success at managing such efforts have combined to make this a productive partnership. The following are data creation projects currently underway or recently completed by LDC staff. 4.1. Telephone Conversations LDC has managed three types of telephone collection projects: CallHome, CallFriend and Switchboard-2. The CallHome project supports large vocabulary conversational speech recognition by collecting, transcribing and providing lexical resources for a number of languages: Spanish, Japanese, Mandarin, English, German and Egyptian Arabic. In each case, we have recorded over 200 30-minute conversations involving pairs of native speakers, transcribed the best 10 minutes and created lexical entries including pronunciation and, where appropriate, romanization and morphological analysis for each word in the transcripts. The CallFriend project supports research in language identification for: Arabic, Canadian French, English (from both northern and southern US states), Farsi, German, Hindi, Japanese, Korean, Mandarin (from mainland China and Taiwan), Russian, Spanish (from the Caribbean and South America), Tamil and Vietnamese. In each case, we have collected 100 5-30 minute conversations from native speaker pairs living in the continental United States, Canada, Puerto Rico and the Dominican Republic. Although most of these calls have not yet been transcribed, there is a growing body of transcription for Spanish, Mandarin, Farsi, Korean and Russian. The Spanish and Mandarin transcripts appear in the LDC catalog; LDC will publish the Farsi, Korean and Russian after their use in evaluation projects. In 1999, LDC began to collect and transcribe Russian and Korean telephone calls. For Russian, LDC collected over 140 telephone conversations among native Russian speakers living in the United States and segmented and transcribed 15 minutes of each of 80 conversations. For Korean, LDC staffers created time-aligned transcriptions for 15 minutes of each of 100 calls originally collected under the 1996 CallFriend project. The Switchboard-2 corpus supports research and development in speaker identification technology. In each phase we collect, on average, 10 5-minute conversations from each of several hundred American English speakers. The subjects do not know each other and are matched in unique pairings and given a topic by the robot operator. In late 1999, LDC began to collect a small corpus of conversations among cellular phone users. The collection is still underway with a goal to collect 10 conversations from each of 190 participants. In 2000, LDC plans to continue transcribing its Farsi data and to continue collecting and begin transcribing cellular phone conversations to support both speaker identification and large vocabulary conversational speech recognition over digital, cellular phone channels.
منابع مشابه
Language Resource Creation and Distribution at the Linguistic Data Consortium: A Progress Report
Changes in the supply of and demand for language resources continues to affect the role of large data centers such as the Linguistic Data Consortium (LDC) and European Language Resource Center (ELRA) within the research communities they serve. The past few years have seen increased demand for: intensively multi-modal resources, larger data sets in high-density languages and new data in low dens...
متن کاملCorpus Development and Publication
This paper will discuss issues relevant to corpus development and publication at the LDC and will illustrate those issues by examining the history of three LDC corpora. This paper will also briefly examine alternative corpus creation and distribution methods and their challenges. The intent of this paper is to increase the available linguistic resources by describing the regulatory and technica...
متن کاملA Progress Report from the Linguistic Data Consortium: Recent Activities in Resource Creation and Distribution and the Development of Tools and Standards
This paper described recent activities of the Linguistic Data Consortium in the collection, annotation and distribution of language data the developments of tools and standards for using that data, the creation of metadata to facilitate the search for linguistic resources.
متن کاملLinguistic Resources for Effective, Affordable, Reusable Speech-to-Text
This paper describes ongoing efforts at Linguistic Data Consortium to create shared evaluation resources for improved speech-to-text technology. The DARPA EARS Program (Effective, Affordable, Reusable Speech-to-Text) is focused on enabling core STT technology to produce rich, highly accurate output in a range of languages and speaking styles. The aggressive EARS program goals motivate new appro...
متن کاملAnnotation Tools for Large-Scale Corpus Development: Using AGTK at the Linguistic Data Consortium
Large-scale corpus development demands substantial infrastructure. As part of this infrastructure, the Linguistic Data Consortium (LDC) has adopted the Annotation Graph Toolkit (AGTK) as a primary resource for annotation tool development. This paper reports on LDC’s experiences using AGTK to develop and implement highly customized annotation tools for a variety of large-scale corpus creation ef...
متن کاملEnhanced Infrastructure for Creation and Collection of Translation Resources
Statistical Machine Translation (MT) systems have achieved impressive results in recent years, due in large part to the increasing availability of parallel text for system training and development. This paper describes recent efforts at Linguistic Data Consortium to create linguistic resources for MT, including corpora, specifications and resource infrastructure. We review LDC's three-pronged a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000